Accelerating Science Discovery - Join the Discussion

OSTIblog Articles in the federated search Topic

Federated Search - The Wave of the Future?: Part 1

by Dr. Walt Warnick 12 Mar, 2008 in Technology

by Walt Warnick and Sol Lederman

The web is growing.

For providing searchable access to the content that matters the most to scientists and researchers, Google and the other web crawlers can't keep up. Instead, growing numbers of scientists, researchers, and science attentive citizens turn to OSTI's federated search applications for high quality research material that Google can't find. And, given fundamental limitations on how web crawlers find content, those conducting research will derive even more benefit from OSTI's innovation and investment in federated search in the coming years.

This is the first of three articles that discuss and compare the strengths and weaknesses of two web search architectures: the crawling and indexing architecture as used today by Google and the federated search architecture used by Science.gov and WorldWideScience.org. This article points out the limitations of the crawling architecture for serious researchers. The second article explains how federated search overcomes these obstacles. The third article highlights a number of OSTI's federated search offerings that advance science, and suggests that federated search may someday become the dominant web search architecture. 

Google is a "surface web" crawler; it discovers content by taking a list of known web pages and following links to new web pages and to documents. This approach finds documents that have links referencing them. It finds none of the majority of web content that is contained in the "deep web."

The deep web...

Related Topics: doe, federated search, osti, web crawling

Read more...

Sophisticated Yet Simple - The Technology Behind OSTI's E-print Network: Part 2

by Sol Lederman 05 Mar, 2008 in Technology

In Part 1 of this series I provided an overview of the technology that drives the E-print Network. In this article I will provide some detail about how the harvested collection, the "E-prints on Web Sites" component of the E-print Network, is constructed. In Part 3, I will discuss the technology of the portion of the E-print Network that relies on federated search of databases.

In Part 1 I explained that the E-print Network combines federated sources searched in real-time with harvested content. The harvested content, consisting of over 1.3 million e-prints, is found by directing a crawler to 28,000 web sites belonging to scientists, researchers, and members of the academic community. In OSTI terminology, harvesting is synonymous with conducting a directed crawl of web sites.

Before we look at the technology behind the harvesting, let's consider the question of why the content is harvested at all. Why not search the contributors' web sites in real-time in the same way that other collections are searched in real-time via federated search? There are several reasons for harvesting the content. First, a large number of e-prints are not found in databases. They are predominantly stored as document files in web server directories. Accessing files stored this way is the job of a web crawler, not that of a federated search engine. This is the case because, a crawler, once it locates the index page for a set of e-prints, easily harvests all e-prints referenced in that index page. The second reason...

Related Topics: doe, E-Print Network (EPN), federated search, osti

Read more...

Sophisticated Yet Simple - The Technology Behind OSTI's E-print Network: Part 1

by Sol Lederman 26 Feb, 2008 in Technology

The E-print Network is one of OSTI's most popular and powerful research offerings yet few of its users know about the advanced technology that drives it and makes it simple to use. Professional researchers in basic and applied science are able to access over 5 million e-prints gathered from nearly 28,000 world-wide databases and web-sites. Numerous OSTI innovations ensure that the E-print Network's documents are of extremely high quality, are highly relevant to researchers, and are easy and quick to find. This is the first in a series of articles about the technology behind this very important component of the Science Accelerator. This article serves as an overview; subsequent articles will provide more technical information.

The E-print Network is a federated search application. It federates (aggregates) search results from over 50 content databases in a number of scientific disciplines from a single user query. The E-print Network, however, uses federated search in an innovative way; One of the databases it searches is a special collection formed by harvesting over 1.3 million E-prints from nearly 28,000 hand-picked web-sites. A custom-designed crawler is responsible for performing the harvesting and custom software is used to build an index of the 1.3 million E-prints so that they can be searched quickly together with the non-harvested databases. Most E-print Network users are unaware that the application is, in fact, a blend of federated search and Google-like crawling technologies. This marriage of the two technologies reflects OSTI's insight in realizing that e-prints not only reside in certain well...

Related Topics: doe, E-Print Network (EPN), federated search, osti, Science Accelerator

Read more...

The Role of Federated Search at OSTI

by Sol Lederman 19 Feb, 2008 in Technology

Federated search is very much at the heart of OSTI's ability to realize its mission. OSTI provides a simple description of what federated search is and how it works in the OSTI environment. The best way to experience the tremendous value of federated search at OSTI is to try several of OSTI's flagship applications:

These, and all, federated search applications search databases "live", which means there is no delay or "lag time" between when a collection is updated by its owner and when the new content can be searched. Science Accelerator provides searchable access to a number of science databases that OSTI manages. Its aim is to accelerate science discovery by greatly reducing the time and effort required for researchers to find relevant science information. Science.gov was OSTI's break-through federated search product; the first version was launched in December 2002. Science.gov provides access to more than 50 million pages of science information from 17 scientific and technical organizations via the collaboration of 13 federal agencies. WorldWideScience is a global science gateway to national and international scientific databases.

The technology used to mine content from the deep web is called "federated search." While federated search is not the only search technology...

Related Topics: federated search, osti, Science Accelerator, Science.gov, WorldWideScience.org (WWS)

Read more...

Pages